Early Gestation RNAseq

Load Packages

First load all packages required for this report. Each package is version controlled through the user of an R environment. This ensures that the report and Docker image are built with the correct version of R and packages from both CRAN and bioconductor. We pull from a renv cache so that the Bioconductor packages are appropriately loaded.

library(dplyr)
library(readr)
library(stringr)
library(tidyr)
library(tibble)
library(magrittr)
library(rmdformats)
library(magrittr)
library(edgeR)
library(DESeq2)
library(ggplot2)
library(ggfortify)
library(cowplot)
library(AnnotationHub)
library(ensembldb)
library(DT)
library(quantro)
library(qsmooth)
library(plotly)
library(here)

Annotations

First need to import annotations, the GRCh37 build of the human genome was used to align the RNA-seq data and so the GRCh37 annotations are used here.

For the annotations I elected to use the more recent annotations (GRCh38) for gene and transcript length. Recent length annotations are more reliable than previous versions and they have been used in this analysis. First, load the annotations, these may be sourced from the AnnotationHub cache.

ah <- AnnotationHub()

ensDb <- ah[["AH78783"]]
## loading from cache
genesGR <- ensembldb::genes(ensDb)
transGR <- transcripts(ensDb)
transGR$gene_name <- mcols(genesGR)[mcols(transGR)$gene_id,"gene_name"]
transGR$length <- lengthOf(ensDb, "tx")

Next we pull the transcript label annotations. These are interchangeable with transcript IDs, but offer cleaner visualisations.

ensDb <- ah[["AH10684"]]
## loading from cache
transcript_names <- ensDb %>%
  as.data.frame() %>%
  dplyr::filter(type == "transcript") %>%
  dplyr::select("transcript_id",
                "transcript_name")

Finally, we can join our annotation information together into final objects for both gene and transcript level annotations.

grch38_txp_anno <- transGR %>%
  as.data.frame() %>%
  dplyr::select("gene_id",
                "gene_name",
                "transcript_id" = "tx_id",
                "width",
                "length",
                "chromosome" = "seqnames",
                "transcript_biotype" = "tx_biotype") %>%
  left_join(transcript_names,
            by = "transcript_id") %>%
  as_tibble()

grch38_gene_anno <- genesGR %>%
  as.data.frame() %>%
  dplyr::select("gene_id",
                "gene_name",
                "length" = "width",
                "chromosome" = "seqnames",
                "gene_biotype") %>%
  as_tibble()

Data import

Now it’s time to import the RNA-seq data. However, samples that were not pure chorionic villus tissue need to be removed from further analyses. Two other samples also had extremely poor sequencing quality and will also be removed. The samples are removed to mitigate any technical or systematic bias in the results.

impure_samples <- c(
  "PAC006", "PAC008", "PAC024", "PAC025",
  "PAC034", "PAC035", "PAC039", "PAC041", 
  "PAC036", "PAC045", "PAC071", "PAC131"
  )

Load in the patient metadata. The Trimester and Timepoint fields have been mutated into the metadata. Trimester defines whether the gestational age falls within the first (0 - 12 weeks) or second (13 - 26 weeks) trimesters of pregnancy.

pd <- here("data/metadata.csv") %>%
  read_csv() %>%
  dplyr::filter(Cohort == "PAC") %>%
  mutate(
    Trimester = as.factor(Trimester),
    Timepoint = ifelse(GestationalAge <= 10, "6-10weeks", "11-23weeks"),
    Timepoint = factor(Timepoint, levels = c("6-10weeks", "11-23weeks"))
  )
## Rows: 125 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): ID, Cohort, Trimester, FetalSex, Oxygen, filenames
## dbl (2): GestationalAge, RIN
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Load in the transcript counts returned from quantification with salmon after implementation of the Selective Alignment method. The counts were imported into R through the use of the catchSalmon function in edgeR prior to this analysis.

transCounts <- readRDS(here("data/salmon_object.rds")) 

rownames(transCounts$counts) <-  gsub(
  "\\..*",
  "",
  rownames(transCounts$counts)
)

rownames(transCounts$annotation) <-  gsub(
  "\\..*",
  "",
  rownames(transCounts$annotation)
)

colnames(transCounts$counts) <- colnames(transCounts$counts) %>%
  basename() %>%
  enframe(name = NULL) %>%
  inner_join(pd, 
             by = c("value" = "filenames")) %>%
  .[["ID"]]

Data Preparation

By plotting the ratio of male:female samples in our data at each gestational age, it is clear that genes on sex chromosomes may be overrepresented across each of ghe timepoints. Hence they are removed in the following code.

autosomeTxpID <- grch38_txp_anno %>%
  dplyr::filter(
    !chromosome %in% c("MT", "X", "Y")
    ) %>%
  as.matrix() %>%
  as.character() %>%
  intersect(
    rownames(transCounts$counts)
    )

transCounts$counts <- transCounts$counts[autosomeTxpID,]
transCounts$annotation <- transCounts$annotation[autosomeTxpID,]

Gene-level analysis

Form gene counts

Now we will aggregate transcript level aboundances into gene counts. Transcript abundances are simply fractions of the full gene count and so merging them offers estimates for gene expression.

txp_mat <- transCounts$counts
  
gene_counts <- txp_mat %>%
  as.data.frame() %>% 
  rownames_to_column(
    var = "transcript_id"
  ) %>%
  left_join(
    grch38_txp_anno,
    by = "transcript_id"
  ) %>%
  dplyr::select(
    -"transcript_id"
  ) %>%
  group_by(gene_id) %>%
  summarise_if(
    is.numeric,
    sum
  ) %>%
  dplyr::select(
    -length,
    -width
  ) %>%
  dplyr::filter(
    !is.na(gene_id)
  ) %>%
  as.data.frame() %>%
  column_to_rownames("gene_id") %>%
  as.matrix() %>% 
  round(0)

Filtering

Next we wish to filter the data to remove low expressed genes. This serves to reduce the number of genes that are used as input for differential gene expression multiple hypothesis testing. Too many genes will result in an inflated false discovery rate. As a big portion of genes will have such low counts that they can be disregarded, we run the risk of sacrificing detection of genes with biological relevance to account for genes with no impact on biology.

filterCounts <- function (
  x,
  minCPM = 2,
  minSamp = 27
) {
  cpm <- cpm(x)
  i <- rowSums(cpm > minCPM) > minSamp
  x[i,]
}

Here is a visual on how the filtering works and its purpose. Our linear modelling of gene expression assumes our data is normally distributed. By removing low expressed genes we remove the noise and leave a clean normal distribution.

unfilt_genes <- dim(gene_counts)[1]

unfilt <- gene_counts %>%
  cpm(log = TRUE) %>%
  as.data.frame() %>%
  pivot_longer(cols = colnames(gene_counts),
               names_to = "sample_ID",
               values_to = "cpm") %>%
  ggplot(aes(x = cpm, colour = sample_ID)) +
  geom_density() +
  labs(x = "Gene Expression (log2 CPM)",
       y = "Density",
       title = paste0("Unfiltered Gene Expression: n = ", unfilt_genes)) +
  theme_bw() +
  theme(axis.title = element_text(colour = "black", size = 14),
        axis.text = element_text(colour = "black", size = 12),
        legend.position = "none")

filt_genes <- dim(filterCounts(gene_counts))[1]

filt <- gene_counts %>%
  filterCounts() %>%
  cpm(log = TRUE) %>%
  as.data.frame() %>%
  pivot_longer(cols = colnames(gene_counts),
               names_to = "sample_ID",
               values_to = "cpm") %>%
  ggplot(aes(x = cpm, colour = sample_ID)) +
  geom_density() +
  labs(x = "Gene Expression (log2 CPM)",
       y = "Density",
       title = paste0("Filtered Gene Expression: n = ", filt_genes)) +
  theme_bw() +
  theme(axis.title = element_text(colour = "black", size = 14),
        axis.text = element_text(colour = "black", size = 12),
        legend.position = "none")

cowplot::plot_grid(unfilt, filt)

Remove alternate builds

A genome build may contain genes that originate from alternate builds, which results in certain genes being counted twice or many times. We simply want the standard build as too many of the same gene may confuse our results.

dge_counts <- gene_counts %>%
  as.data.frame() %>%
  rownames_to_column("gene_id") %>%
  dplyr::filter(str_detect(gene_id, "^E")) %>%
  as.data.frame() %>%
  column_to_rownames("gene_id") %>%
  as.matrix() %>%
  filterCounts() %>%
  .[,!colnames(.) %in% impure_samples]

Create DGEList

With the gene counts, metadate, and annotations handy, we can use them to form our DGEList. This is similar to a Summarised Experiment or even a ‘data cube’ where each table informs the counts table based on the gene ID match.

DGElist <- dge_counts %>%
  DGEList(
    samples = colnames(.) %>%
      enframe(name = NULL, value = "ID") %>%
      left_join(pd),
    genes = rownames(.) %>%
      enframe(name = NULL, value = "gene_id") %>%
      left_join(grch38_gene_anno,
                by = "gene_id"))
## Joining with `by = join_by(ID)`

Smooth quantile normalisation

The quantro package can be used to test for global differences in distributions between groups to guide the choice of whether it is appropriate to use global normalization methods, such as quantile normalisation. For instance, if the variability between groups is larger than the variability within each group then there may be global differences present rendering quantile normalisation inappropriate (depending on the type and source of variation)

As described in the quantro paper the main function performs two tests. * An ANOVA to test if the medians of distributions are different across groups. Differences across groups could be attributed to unwanted technical variation (such as batch effects) or real global biological variation. This is a helpful step for the user to verify if there is any technical variation unaccounted for.

  • A test for global differences between the distributions across groups which returns a test statistic called quantroStat. This test statistic is a ratio of two variances (similar to the idea of ANOVA): the variability of the distributions within groups relative to the variability between groups. If the variability between groups is sufficiently larger than the variability within groups, then this suggests global adjustment methods may not be appropriate.

The reader is encouraged to look to the original paper for more information if desired.

eset <- ExpressionSet(
  assayData = round(DGElist$counts, 0),
  phenoData = AnnotatedDataFrame(
    pd %>%
      as.data.frame() %>%
      set_rownames(.$ID) %>%
      .[colnames(DGElist$counts),] 
    )
  )

keepMeID <- sapply(
  1:nrow(eset),
  function (x) { any(exprs(eset)[x,] != 0) }
)
eset <- eset[keepMeID,]

ds <- DESeqDataSetFromMatrix(
  countData = eset,
  colData = pData(eset),
  design = ~ Timepoint
)
## converting counts to integer mode
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
r_ds <- vst(ds)
r_ds_mat <- as.matrix(assay(r_ds))

matdensity(
  r_ds_mat,
  groupFactor = pData(eset)$Timepoint,
  xlab = "counts",
  ylab = "density",
  main = "Gene counts",
  brewer.n = 8,
  brewer.name = "Dark2"
  )
legend(
  'top',
  c("NeuN_neg", "NeuN_pos"),
  col = c(1, 2),
  lty = 1,
  lwd = 3
)

The results of the quantro test

qFit <- quantro(
  r_ds_mat, 
  groupFactor = pData(eset)$Timepoint,
  B = 1000
)
## [quantro] Average medians of the distributions are 
##                         not equal across groups.
## [quantro] Calculating the quantro test statistic.
## [quantro] Starting permutation testing.
## [quantro] Using a single core (backend: doSEQ, 
##                         version: 1.5.2).
print(qFit)
## quantro: Test for global differences in distributions
##    nGroups:  2 
##    nTotSamples:  84 
##    nSamplesinGroups:  27 57 
##    anovaPval:  0 
##    quantroStat:  15.25473 
##    quantroPvalPerm:  < 0.001

Smooth quantile normalisation is applied as the differences between distributions aren’t greatly exceeding sample variability. Smooth quantile normalisation is applied using methods from the qsmooth paper

qs <- qsmooth(
  exprs(eset),
  group_factor = pData(eset)$Timepoint
)

DGElist$counts <- qs@qsmoothData

Transcripts were considered as detectbale if above 2CPM in more than 27 samples.

qsmooth::qsmoothData(qs) %>%
  as.data.frame() %>%
  rownames_to_column("gene") %>%
  pivot_longer(
    cols = starts_with("PAC"),
    names_to = "ID",
    values_to = "logCPM"
  ) %>%
  left_join(pd) %>%
  ggplot(
    aes(
      logCPM,
      stat(density), 
      colour = Timepoint, 
      group = ID
      )
  ) +
  geom_density() +
  scale_x_log10() +
  theme_bw()
## Joining with `by = join_by(ID)`
## Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 41 rows containing non-finite outside the scale range
## (`stat_density()`).

Principal Component Analysis - Gene Counts

A Principal Component Analysis (PCA) is a statistical technique that reduces the dimensionality of a dataset while preserving as much variability as possible. This is achieved by transforming the original variables into a new set of variables called principal components (PC). These capture a specific amount of variability, where the first PC (PC1) contains the most variation, PC2 the next most, and so on. Essentially, the first few PCs retain most of the variation present in the original dataset.

The samples are spread on the first principle component by gestational age, meaning gestational age contributes the most variation in our dataset, which is great since we will be testin by timepoint, which is gestational age.

DGElist$counts %>%
  cpm(log = TRUE) %>%
  t() %>%
  prcomp() %>%
  autoplot(
    data = DGElist$samples,
    colour = "GestationalAge",
    size = 5
  ) +
  ggtitle("Principal Component Analysis of Gene Expression") +
  labs(colour = "Gestational Age\n(weeks)") + 
  scale_colour_viridis_c(option = "magma") +
  theme_bw() +
  theme(axis.title = element_text(size = 14, 
                                  colour = "black"),
        axis.text = element_text(size = 12,
                                 colour = "black"),
        plot.title = element_text(size = 15,
                                  colour = "black",
                                  face = "bold")
  )

Differential Gene Expression Analysis

Now we perform our DGE analysis. A Genewise Negative Binomial Generalized Linear Models with Quasi-Dispersion Estimation is implemented. This model accurately models overdispersion that is inherent to RNA-seq data and provides a dispersion parameter for each gene instead of assuming a general dispersion for each one. The Quasi-Dispersion estimation adjusts for sample-specific variability as well to hone in on true biological noise in the data.

First, we set some parameters to control the effect sizes. alpha sets the FDR threshold for statistical significance. Since we are multiple hypothesis testing thousands of genes, an FDR provides a quality control for Type I error, which are false positives.

The lambda parameter provides a threshold of log fold change before a gene is considered biologically significant. A lambda of 1.4 means only genes with over 2.6 times differential expression will be counted. (Ideally I wanted a 2.5 times difference in cutoff, but 1.3 didn’t give that and I decided to use a relatively rounded number of 1.4 instead of 1.321928)

alpha <- 0.05
lambda <- 1.4

For differential transcript expression we are testing to see if the expression of individual transcripts are exceeding a logFC between the before and after 10 weeks’ samples as specified by \(\lambda\) using the glmQLFit function.

null: logFC is between -\(\lambda\) and +\(\lambda\) alternate: absolute value of logFC is greater than \(\lambda\)

For all analyses for the value \(\lambda\) the value 1.4 was chosen

designMatrix <- model.matrix(
  ~ Timepoint +
    FetalSex,
  data = DGElist$samples
)

colnames(designMatrix) <- colnames(designMatrix) %>%
  str_remove("Timepoint") %>%
  str_remove("FetalSex")

DGElist <- DGElist %>%
  estimateCommonDisp(
    design = designMatrix
  )

oxygenGenes <- glmQLFit(DGElist,
                        design = designMatrix) %>%
  glmTreat(coef = "11-23weeks",
           lfc = log2(lambda)) %>%
  topTags(n = Inf) %>%
  .[["table"]] %>%
  as_tibble() %>%
  dplyr::mutate(DE = FDR < alpha & abs(logFC) > 1) %>%
  dplyr::rename("Gene" = "gene_id",
                "GeneName" = "gene_name",
                "Gene_biotype" = "gene_biotype")

sig_genes <- oxygenGenes %>%
  dplyr::filter(FDR < 0.05 & abs(logFC) > 1) 

A volcano plot shows us the logFC with the FDR to give an overview of all significant transcripts in our results. The x axis shows the logFC while the y axis shows the -log10(FDR) Hover over each point to get info about it

p_gene <- oxygenGenes %>%
  dplyr::mutate(direction = case_when(
    logFC > 0 & DE == TRUE ~ "Up",
    logFC <= 0 & DE == TRUE ~ "Down",
    .default = "Non-significant"
  )) %>%
  ggplot(aes(x = logFC, y = -log10(FDR), colour = direction,
             text = paste0("Gene: ", GeneName,
                           "<br>logFC: ", round(logFC, 2),
                           "<br>FDR: ", formatC(FDR, digits = 2)))) +
  geom_point() +
  scale_colour_manual(values = c("blue", "grey80", "red")) +
  labs(x = "Log2 Fold Change",
       y = "FDR (-log10)") +
  theme_bw() +
  theme(axis.title = element_text(colour = "black", size = 14),
        axis.text = element_text(colour = "black", size = 12),
        legend.position = "none")

  ggplotly(p_gene, tooltip = "text")

Now we have our differentially expressed genes

DT::datatable(sig_genes)

Here are the numbers

print(dim(sig_genes))
## [1] 884  11

Transcript-level analysis

Now we run the previous code back with transcripts

Filtering

We wish to filter the data to remove low expressed genes. This serves to reduce the number of genes that are used as input for differential gene expression multiple hypothesis testing. Too many genes will result in an inflated false discovery rate. As a big portion of genes will have such low counts that they can be disregarded, we run the risk of sacrificing detection of genes with biological relevance to account for genes with no impact on biology.

filterCounts <- function (
  x,
  minCPM = 2,
  minSamp = 27
) {
  cpm <- cpm(x)
  i <- rowSums(cpm > minCPM) > minSamp
  x[i,]
}

Here is a visual on how the filtering works and its purpose. Our linear modelling of transcript expression assumes our data is normally distributed. By removing low expressed transcripts we remove the noise and leave a clean normal distribution.

tx_counts <- transCounts %>%
 with(counts / annotation$Overdispersion)

unfilt_tx <- dim(tx_counts)[1]

unfilt_tx_exp <- tx_counts %>%
  cpm(log = TRUE) %>%
  as.data.frame() %>%
  pivot_longer(cols = colnames(tx_counts),
               names_to = "sample_ID",
               values_to = "cpm") %>%
  ggplot(aes(x = cpm, colour = sample_ID)) +
  geom_density() +
  labs(x = "Transcript Expression (log2 CPM)",
       y = "Density",
       title = paste0("Unfiltered Transcript Expression: n = ", unfilt_tx)) +
  theme_bw() +
  theme(axis.title = element_text(colour = "black", size = 14),
        axis.text = element_text(colour = "black", size = 12),
        legend.position = "none")

filt_tx <- dim(filterCounts(tx_counts))[1]

filt_tx_exp <- tx_counts %>%
  filterCounts() %>%
  cpm(log = TRUE) %>%
  as.data.frame() %>%
  pivot_longer(cols = colnames(tx_counts),
               names_to = "sample_ID",
               values_to = "cpm") %>%
  ggplot(aes(x = cpm, colour = sample_ID)) +
  geom_density() +
  labs(x = "Transcript Expression (log2 CPM)",
       y = "Density",
       title = paste0("Filtered Transcript Expression: n = ", filt_tx)) +
  theme_bw() +
  theme(axis.title = element_text(colour = "black", size = 14),
        axis.text = element_text(colour = "black", size = 12),
        legend.position = "none")

cowplot::plot_grid(unfilt_tx_exp, filt_tx_exp)

Remove alternate builds

A genome build may contain genes that originate from alternate builds, which results in certain genes being counted twice or many times. We simply want the standard build as too many of the same gene may confuse our results.

dte_counts <- transCounts %>%
 with(counts / annotation$Overdispersion) %>%
  as.data.frame() %>%
  rownames_to_column("transcript_id") %>%
  dplyr::filter(str_detect(transcript_id, "^E")) %>%
  as.data.frame() %>%
  column_to_rownames("transcript_id") %>%
  as.matrix() %>%
  filterCounts() %>%
  .[,!colnames(.) %in% impure_samples]

Create DTEList

Here we create DTElist and calculate normalisation factors using qsmooth. A common, but uncommonly addressed, question which arises in RNA-seq analyses is if the normalisation methods used in the workflow are appropriate based on the distribution of the reads in the data. There is a risk of normalising the biological differences in the data in certain scenarios which may be unnecessary, and at worst inappropriate, for the data under analysis. Most differential expression methods operate under the assumption that only a small proportion of the genes in question are differentially expressed and that differences in read distributions between testing samples are due to technical issues in the data.

DTElist <- dte_counts %>%
  DGEList(
    samples = colnames(.) %>%
      enframe(name = NULL, value = "ID") %>%
      left_join(pd),
    genes = rownames(.) %>%
      enframe(name = NULL, value = "transcript_id") %>%
      left_join(grch38_txp_anno,
                by = "transcript_id"))
## Joining with `by = join_by(ID)`

Smooth quantile normalisation

The quantro package can be used to test for global differences in distributions between groups to guide the choice of whether it is appropriate to use global normalization methods, such as quantile normalisation. For instance, if the variability between groups is larger than the variability within each group then there may be global differences present rendering quantile normalisation inappropriate (depending on the type and source of variation)

As described in the quantro paper the main function performs two tests. * An ANOVA to test if the medians of distributions are different across groups. Differences across groups could be attributed to unwanted technical variation (such as batch effects) or real global biological variation. This is a helpful step for the user to verify if there is any technical variation unaccounted for.

  • A test for global differences between the distributions across groups which returns a test statistic called quantroStat. This test statistic is a ratio of two variances (similar to the idea of ANOVA): the variability of the distributions within groups relative to the variability between groups. If the variability between groups is sufficiently larger than the variability within groups, then this suggests global adjustment methods may not be appropriate.

The reader is encouraged to look to the original paper for more information if desired.

eset <- ExpressionSet(
  assayData = round(DTElist$counts, 0),
  phenoData = AnnotatedDataFrame(
    pd %>%
      as.data.frame() %>%
      set_rownames(.$ID) %>%
      .[colnames(DTElist$counts),]
    )
  )

keepMeID <- sapply(
  1:nrow(eset),
  function (x) { any(exprs(eset)[x,] != 0) }
)
eset <- eset[keepMeID,]

ds <- DESeqDataSetFromMatrix(
  countData = eset,
  colData = pData(eset),
  design = ~ Timepoint
)
## converting counts to integer mode
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
r_ds <- vst(ds)
r_ds_mat <- as.matrix(assay(r_ds))

matdensity(
  r_ds_mat,
  groupFactor = pData(eset)$Timepoint,
  xlab = "counts",
  ylab = "density",
  main = "Transcript counts",
  brewer.n = 8,
  brewer.name = "Dark2"
  )
legend(
  'top',
  c("NeuN_neg", "NeuN_pos"),
  col = c(1, 2),
  lty = 1,
  lwd = 3
)

The results of the quantro test

qFit <- quantro(
  r_ds_mat, 
  groupFactor = pData(eset)$Timepoint,
  B = 1000
)
## [quantro] Average medians of the distributions are 
##                         not equal across groups.
## [quantro] Calculating the quantro test statistic.
## [quantro] Starting permutation testing.
## [quantro] Using a single core (backend: doSEQ, 
##                         version: 1.5.2).
print(qFit)
## quantro: Test for global differences in distributions
##    nGroups:  2 
##    nTotSamples:  84 
##    nSamplesinGroups:  27 57 
##    anovaPval:  0 
##    quantroStat:  6.67969 
##    quantroPvalPerm:  0.002

Smooth quantile normalisation is applied as the differences between distributions aren’t greatly exceeding sample variability. Smooth quantile normalisation is applied using methods from the qsmooth paper

qs <- qsmooth(
  exprs(eset),
  group_factor = pData(eset)$Timepoint
)

DTElist$counts <- qs@qsmoothData

Transcripts were considered as detectbale if above 2CPM in more than 27 samples.

qsmooth::qsmoothData(qs) %>%
  as.data.frame() %>%
  rownames_to_column("gene") %>%
  pivot_longer(
    cols = starts_with("PAC"),
    names_to = "ID",
    values_to = "logCPM"
  ) %>%
  left_join(pd) %>%
  ggplot(
    aes(
      logCPM,
      stat(density), 
      colour = Timepoint, 
      group = ID
      )
  ) +
  geom_density() +
  scale_x_log10() +
  theme_bw()
## Joining with `by = join_by(ID)`
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 30 rows containing non-finite outside the scale range
## (`stat_density()`).

Principal Component Analysis - Transcript-level

A Principal Component Analysis (PCA) is a statistical technique that reduces the dimensionality of a dataset while preserving as much variability as possible. This is achieved by transforming the original variables into a new set of variables called principal components (PC). These capture a specific amount of variability, where the first PC (PC1) contains the most variation, PC2 the next most, and so on. Essentially, the first few PCs retain most of the variation present in the original dataset.

The samples are spread on the first principle component by gestational age, meaning gestational age contributes the most variation in our dataset, which is great since we will be testin by timepoint, which is gestational age.

fig1 <- DTElist$counts %>%
  cpm(log = TRUE) %>%
  t() %>%
  prcomp() %>%
  autoplot(
    data = DTElist$samples,
    colour = "GestationalAge",
    size = 5
  ) +
  scale_colour_viridis_c(option = "magma") +
  ggtitle("Principal Component Analysis of Transcript Expression") +
  labs(colour = "Gestational Age\n(weeks)") + 
  theme_bw() +
  theme(axis.title = element_text(size = 20, 
                                  colour = "black"),
        legend.title = element_text(size = 20,
                                    colour = "black"),
        legend.text = element_text(size = 18,
                                   colour = "black"),
        axis.text = element_text(size = 18,
                                 colour = "black"),
        plot.title = element_text(size = 20,
                                  colour = "black",
                                  face = "bold")
  )

fig1

Differential Transcript Expression Analysis

Now we perform our DGE analysis. A Genewise (or I guess transcript-wise) Negative Binomial Generalized Linear Models with Quasi-Dispersion Estimation is implemented. This model accurately models overdispersion that is inherent to RNA-seq data and provides a dispersion parameter for each gene instead of assuming a general dispersion for each one. The Quasi-Dispersion estimation adjusts for sample-specific variability as well to hone in on true biological noise in the data.

First, we set some parameters to control the effect sizes. alpha sets the FDR threshold for statistical significance. Since we are multiple hypothesis testing thousands of genes, an FDR provides a quality control for Type I error, which are false positives.

The lambda parameter provides a threshold of log fold change before a gene is considered biologically significant. A lambda of 1.4 means only genes with over 2.6 times differential expression will be counted. (Ideally I wanted a 2.5 times difference in cutoff, but 1.3 didn’t give that and I decided to use a relatively rounded number of 1.4 instead of 1.321928)

alpha <- 0.05
lambda <- 1.4

For differential transcript expression we are testing to see if the expression of individual transcripts are exceeding a logFC between the before and after 10 weeks’ samples as specified by \(\lambda\) using the glmQLFit function.

null: logFC is betweem -\(\lambda\) and +\(\lambda\) alternate: absolute value of logFC is greater than \(\lambda\)

For all analyses for the value \(\lambda\) the value 1.4 was chosen

designMatrix <- model.matrix(
  ~ Timepoint +
    FetalSex,
  data = DTElist$samples
)

colnames(designMatrix) <- colnames(designMatrix) %>%
  str_remove("Timepoint") %>%
  str_remove("FetalSex")


DTElist <- DTElist %>%
  estimateCommonDisp(
    design = designMatrix
  )

oxygenTranscripts <- glmQLFit(DTElist,
                              design = designMatrix) %>%
  glmTreat(coef = "11-23weeks",
           lfc = log2(lambda)) %>%
  topTags(n = Inf) %>%
  .[["table"]] %>%
  as_tibble() %>%
  dplyr::mutate(DE = FDR < alpha & abs(logFC) > 1) %>%
  dplyr::rename("Transcript" = "transcript_id",
                "TranscriptName" = "transcript_name",
                "Transcript_biotype" = "transcript_biotype",
                "Gene" = "gene_id",
                "GeneName" = "gene_name")

sig_tx <- oxygenTranscripts %>%
  dplyr::filter(FDR < 0.05 & abs(logFC) > 1) 

Now pump out the table

DT::datatable(sig_tx)

A volcano plot shows us the logFC with the FDR to give an overview of all significant transcripts in our results. The x axis shows the logFC while the y axis shows the -log10(FDR) Hover over each point to get info about it

p <- oxygenTranscripts %>%
  dplyr::mutate(direction = case_when(
    logFC > 0 & DE == TRUE ~ "Up",
    logFC <= 0 & DE == TRUE ~ "Down",
    .default = "Non-significant"
  )) %>%
  ggplot(aes(x = logFC, y = -log10(FDR), colour = direction,
             text = paste0("Gene: ", GeneName,
                           "<br>Transcript: ", TranscriptName,
                           "<br>logFC: ", round(logFC, 2),
                           "<br>FDR: ", formatC(FDR, digits = 2)))) +
  geom_point() +
  scale_colour_manual(values = c("blue", "grey80", "red")) +
  labs(x = "Log2 Fold Change",
       y = "FDR (-log10)") +
  theme_bw() +
  theme(axis.title = element_text(colour = "black", size = 14),
        axis.text = element_text(colour = "black", size = 12),
        legend.position = "none")

  ggplotly(p, tooltip = "text")

And check numbers here

print(dim(sig_tx))
## [1] 1017   14